Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Abstract Neuroimaging studies, such as the Human Connectome Project (HCP), often collect multifaceted data to study the human brain. However, these data are often analyzed in a pairwise fashion, which can hinder our understanding of how different brain-related measures interact. In this study, we analyze the multi-block HCP data using data integration via analysis of subspaces (DIVAS). We integrate structural and functional brain connectivity, substance use, cognition, and genetics in an exhaustive five-block analysis. This gives rise to the important finding that genetics is the single data modality most predictive of brain connectivity, outside of brain connectivity itself. Nearly 14% of the variation in functional connectivity (FC) and roughly 12% of the variation in structural connectivity (SC) is attributed to shared spaces with genetics. Moreover, investigations of shared space loadings provide interpretable associations between particular brain regions and drivers of variability. Novel Jackstraw hypothesis tests are developed for the DIVAS framework to establish statistically significant loadings. For example, in the (FC, SC, and substance use) subspace, these novel hypothesis tests highlight largely negative functional and structural connections suggesting the brain’s role in physiological responses to increased substance use. Our findings are validated on genetically relevant subjects not studied in the main analysis.more » « lessFree, publicly-accessible full text available June 16, 2026
-
High intratumoral heterogeneity is thought to be a poor prognostic indicator. However, the source of heterogeneity may also be important, as genomic heterogeneity is not always reflected in histologic or ‘visual’ heterogeneity. We aimed to develop a predictor of histologic heterogeneity and evaluate its association with outcomes and molecular heterogeneity. We used VGG16 to train an image classifier to identify unique, patient-specific visual features in 1655 breast tumors (5907 core images) from the Carolina Breast Cancer Study (CBCS). Extracted features for images, as well as the epithelial and stromal image components, were hierarchically clustered, and visual heterogeneity was defined as a greater distance between images from the same patient. We assessed the association between visual heterogeneity, clinical features, and DNA-based molecular heterogeneity using generalized linear models, and we used Cox models to estimate the association between visual heterogeneity and tumor recurrence. Basal-like and ER-negative tumors were more likely to have low visual heterogeneity, as were the tumors from younger and Black women. Less heterogeneous tumors had a higher risk of recurrence (hazard ratio = 1.62, 95% confidence interval = 1.22–2.16), and were more likely to come from patients whose tumors were comprised of only one subclone or had a TP53 mutation. Associations were similar regardless of whether the image was based on stroma, epithelium, or both. Histologic heterogeneity adds complementary information to commonly used molecular indicators, with low heterogeneity predicting worse outcomes. Future work integrating multiple sources of heterogeneity may provide a more comprehensive understanding of tumor progression.more » « less
-
Abstract Model systems are an essential resource in cancer research. They simulate effects that we can infer into humans, but come at a risk of inaccurately representing human biology. This inaccuracy can lead to inconclusive experiments or misleading results, urging the need for an improved process for translating model system findings into human-relevant data. We present a process for applying joint dimension reduction (jDR) to horizontally integrate gene expression data across model systems and human tumor cohorts. We then use this approach to combine human TCGA gene expression data with data from human cancer cell lines and mouse model tumors. By identifying the aspects of genomic variation joint-acting across cohorts, we demonstrate how predictive modeling and clinical biomarkers from model systems can be improved.more » « less
-
Abstract Background Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson .more » « less
-
In The Cancer Genome Atlas (TCGA) data set, there are many interesting nonlinear dependencies between pairs of genes that reveal important relationships and subtypes of cancer. Such genomic data analysis requires a rapid, powerful and interpretable detection process, especially in a high-dimensional environment. We study the nonlinear patterns among the expression of pairs of genes from TCGA using a powerful tool called Binary Expansion Testing. We find many nonlinear patterns, some of which are driven by known cancer subtypes, some of which are novel.more » « less
An official website of the United States government
